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tRNADB-CE: tRNA gene database well-timed in the era of 
big sequence data 

TakashiAbe ^ Hachiro Inokuchi^, Yuko Yamada^ , Akira Muto^, Yuki Iwasaki^ andToshimichi Ikemura^ 



The tRNA gene data base curated by experts "tRNADB-CE" (http://trna.ie.niigata-u.ac.jp) 
was constructed by analyzing 1 ,966 complete and 5,272 draft genomes of prokaryotes, 171 
viruses', 121 chloroplasts', and 12 eukaryotes' genomes plus fragment sequences obtained 
by metagenome studies of environmental samples. 595,115 tRNA genes in total, and thus 
two times of genes compiled previously, have been registered, for which sequence, clover- 
leaf structure, and results of sequence-similarity and oligonucleotide-pattern searches can 
be browsed. To provide collective knowledge with help from experts in tRNA researches, 
we added a column for enregistering comments to each tRNA. By grouping bacterial 
tRNAs with an identical sequence, we have found high phylogenetic preservation of 
tRNA sequences, especially at the phylum level. Since many species-unknown tRNAs 
from metagenomic sequences have sequences identical to those found in species-known 
prokaryotes, the identical sequence group (ISG) can provide phylogenetic markers to 
investigate the microbial community in an environmental ecosystem. This strategy can be 
applied to a huge amount of short sequences obtained from next-generation sequencers, 
as showing that tRNADB-CE is a well-timed database in the era of big sequence data. It is 
also discussed that batch-learning self-organizing-map with oligonucleotide composition is 
useful for efficient knowledge discovery from big sequence data. 
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INTRODUCTION 

After completion of genome sequencing, tRNA genes on each 
genome have been predicted using computer programs such as 
tRNAscan-SE (Lowe and Eddy, 1997), ARAGORN (Laslett and 
Canback, 2004), and tRNAfinder (Kinouchi and Kurokawa, 2006) 
and included in the flat file of the sequence data for registration 
to the International DNA Data Banks (DDBJ/ENA/GenBank). 
However, in approximately 5% of the completely sequenced 
genomes, the annotation data of tRNA genes have not been 
included. There also exist many cases where an important infor- 
mation of anticodon (and thus of amino acid) has not be 
added while the tRNA itself is noted. The compilation of tRNA 
sequences and tRNA genes has been published for the first 
time by Sprinzl etal. (1978) and has been updated and recon- 
structed the database including modified nucleosides (tRNAdb^; 
Juhling etal, 2009). Using tRNAscan-SE, the Genomic tRNA 
Database (GtRNAdb^) has been constructed for complete and 
near complete genomes (Chan and Lowe, 2009). In addition, a 
database for comprehensive listing of modified nucleotide (The 
RNA Modification Database^; Limbach etal, 1994) and that for 
modification pathways (MODOMICS^; Machnicka etal, 2012) 



^http://trnadb.bioinf.uni-leipzig.de/ 
^httpV/lowelab.ucsc.edu/GtRNAdb/ 
^http://mods.rna.albany.edu/home 
^ http : / / modomics . genesilico.pl/ 



have been reported. Mamit-tRNA^ is a compilation of mam- 
malian mitochondrial tRNA genes (Piitz etal., 2007). Another 
characteristic tRNA database is SPLITSdb^ constructed by Keio 
Group (Sugahara etal, 2006, 2008), which has compiled tRNA 
sequences in archaeal and primitive eukaryotic species for pro- 
moting next studies of tRNA evolution and processing; 671 
intron-containing and 12 split tRNAs have been registered. 
In the present Review Article, we introduce current status of 
our "tRNA gene database curated by experts (tRNADB-CE)" 
and compare it with other tRNA databases for explaining its 
characteristics. 

In accord with the remarkable progress of DNA sequencing 
technology, a vast quantity of partially sequenced draft genome 
sequences and metagenomic sequences obtained from a wide 
variety of environmental and clinical samples have been com- 
piled in DDBJ/ENA/GenBank. However, no information of tRNA 
genes has been added to draft and metagenomic sequences in 
DDBJ/ENA/GenBank and in GtRNAdb and tRNAdb. Metage- 
nomic sequences have attracted broad industrious interests; even 
short sequences obtained with new-generation sequencers con- 
tain a large number of full-length tRNAs because tRNA lengths 
are short. Search for tRNA genes in metagenomic sequences may 
provide a new strategy for clarifying the microbial community 



^http://mamit-trna.u-strasbg.fr/ 
^http://splits. iab.keio.ac.jp/splitsdb/ 



www.f rontiersin .org 



May 2014 | Volume 5 | Article 114 | 1 



Abe etal. 



tRNADB-CE 



in an environmental sample. This prediction is supported by the 
findings that examination of a group of tRNAs with an identical 
sequence obtained from species-known prokaryotes has revealed 
that such tRNAs belong primarily to a specific phylogenetic group 
and that the phylotype-specific tRNA sequences have also been 
found in species-unknown metagenomic sequences. This supports 
the view that these tRNA genes should become good phylogenetic 
markers for studying phylotype composition in an environmental 
ecosystem. It should be noted here that, when analyzing a dataset 
composed only of short fragmental sequences (e.g., 100 nt), 
genomic sequences other than tRNA genes appears to be diffi- 
cult to be properly used for phylogenetic assignments, except for 
dominant genomes for which contiguous sequences with a suf- 
ficient length for constructing reliable phylogenetic trees can be 
obtained. 

GENOMIC SEQUENCES ANALYZED AND METHODS OF 
SEARCH FOR tRNA GENES 

Because one important role of tRNADB-CE is a use of tRNA 
genes as phylogenetic markers in metagenomic studies, we have 
mainly focused on microbial genomes. The following sources 
of DNA sequences have been used for constructing the present 
version of tRNADB-CE: 1966 and 171 complete prokaryote and 
virus genomes released by DDBJ/EMBL/GenBank up to Septem- 
ber 2012, 121 complete chloroplast genomes released by organelle 
genome database (GOBASE^) up to March 2009, 5272 prokaryote 
draft genomes released by WGS division in DDBJ/ENA/GenBank 
up to September 2012, 12 eukaryote complete genomes, 17 mil- 
lion metagenomic sequences released by DDBJ/ENA/GenBank up 
to March 2012, and 217 million metagenomic sequences obtained 
using next-generation sequencers and released by sequence read 
achieve (SRA^) in NCBI up to March 2012. 

To enhance the completeness and accuracy of searching 
for tRNA genes, three computer programs, tRNAscan-SE, 
ARAGORN, and tRNAfinder have been used in combination, 
since their algorithms are partially different and have rendered 
somewhat different results. The tRNA genes found concordantly 
by all three programs were stored in tRNADB-CE after brief 
manual checking for cases of non-standard anticodons, such 
as bacterial A- starting anticodons (except for Arg) and those 
responding to termination codons, as explained later. Discordant 
cases among the three programs (approximately 5% of the total 
of tRNA gene candidates predicted by at least one program) were 
manually checked by three experts (YY, AM, and HI) in tRNA 
experimental fields and were classified into three categories: (A) 
reliable tRNA genes, (B) not tRNA genes, and (C) ambiguous 
cases. 

The tRNA genes of Archaea obtained from SPLITSdb^ con- 
structed by Keio Group (Sugahara etal, 2006, 2008) are included 
in the current version of tRNADB-CE. Basic functions of the 
database have been described previously (Abe etal., 2009, 2011). 
For fragment sequences obtained by metagenome studies, only 
tRNA genes found concordantly by the three programs and those 



^http://gobase.bcm.umontreal.ca/ 

^ http : // www. ncbi . nlm . nih . gov/Tr aces/sr a/ 

^http://splits. iab.keio.ac.jp/splitsdb/ 



with sequences identical to the tRNAs already included in the 
database were stored. Many tRNA genes were detected in vari- 
ous environmental samples, and the number was registered for 
each environment sample. This result has enabled us to predict 
the microbial population in an environmental ecosystem by using 
tRNAs as phylogenetic markers. Since a significant portion of envi- 
ronmental sequences is thought to be derived from unculturable 
microbes, tRNA genes of novel microbes should be included. 

tRNA GENES COMPILED IN THE CURRENT VERSION AND A 
NEW FUNCTION FOR ORGANIZING COLLECTIVE KNOWLEDGE 

In the present database, 595,115 tRNA genes in total (112266, 
317508, 961, 3534, 4137, and 156709 genes from 1966 complete 
and 5272 draft prokaryote genomes, 171 viruses, 121 chloro- 
plasts, 12 eukaryotes, and 221 metagenomic samples, respectively) 
have been registered. This number is two times as many genes as 
were previously registered (Abe etal., 2011) and functions of the 
database are listed in Figure 1. Comparison of registered data and 
functions among three tRNA databases is presented in Table 1. 
More than ten and five times of tRNAs have been compiled by 
tRNADB-CE than by tRNAdb and GtRNAdb, respectively This is 
mainly because tRNADB-CE has included tRNAs obtained from 
draft genome and metagenome sequences, as well as virus and 
plastid sequences. Another important difference of tRNADB-CE 
from others is the use of three computer programs for tRNA gene 
search. 

The nucleotide frequency in each position on the tRNA 
cloverleaf structure is presented on a statistics page, providing 
information on consensus nucleotides for each anticodon of indi- 
vidual data types (Figure lA). Since a vast amount of tRNAs has 
been registered, this is useful for systematic search for identifier 
nucleotides, which interact with aminoacyl-tRNA synthetases. To 
aim at creating a high quality database by collecting knowledge 
in various tRNA research fields, we developed a new function for 
including comments on each gene in "The detailed information 
of tRNA gene sequence page" (Figure ID). Users can add com- 
ments after typing in their e-mail address and password, while we 
reserve the right to remove irrelevant comments. We hope that the 
accumulation of user's comments will provide annotations with 
high quality and this database will become an information sharing 
system in the tRNA research community. 

IDENTICAL SEQUENCE GROUPS AND THEIR USE AS 
PHYLOGENETIC MARKERS 

We investigated a group of tRNAs with an identical sequence and 
found that one group was composed primarily of tRNAs derived 
from species belonging to a certain phylogenetic group, i.e., 
phylotype-specific tRNAs. To ymfy this phylogenetic preservation 
of tRNA sequences in more detail, we conducted the clustering of 
429,774 tRNA sequences derived from 7,237 prokaryote genomes 
through CD-HIT sequence alignment (Li and Godzik, 2006). Then 
we designated the group with an identical sequence as "Identical 
Sequence Group: ISG". When focusing on ISGs composed of more 
than five genes, 95% of ISGs have been conserved at the phylum 
level, showing that most tRNA sequences are good phylogenetic 
markers at least at the phylum level. In addition, approximately 
65% of ISGs have been conserved at the genus level, indicating 
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keyword and pattern search. 



that more than half of ISGs may be usable as genus-specific mak- 
ers. This is because tRNA sequences have been stably conserved 
during evolution. Increases in the number of genes in each ISG 
in the near future will progressively clarify the phylogenetic range 
that can be covered by one ISG. By combining the data provided 
by this database with other knowledge obtained from experiments 
or literatures, users can choose useful phylogenetic markers (e.g., 
genus-specific markers) by themselves. Our group has started 
to search phylogenetic markers for the genomes that are rare in 
regular environments and will publish these markers in the afore- 
mentioned column for comments to each tRNA gene. It should be 
noted here that horizontal gene transfer between different species 
is a general characteristic of microbial genomes. Therefore, we 
may not find phylogenetic markers with the 100% accuracy. Any 
informatics method, including sequence homology searches, most 
likely assigns the horizontally transferred elements to the donor 
but not the recipient genome. When certain tRNAs that have pri- 
marily found in a certain phylotype have been additionally found 
in the phylogenetically distant but restricted species, these tRNAs 



may represent genes that have been transferred horizontally to the 
restricted species or products of the convergent evolution. Phylo- 
genetic marker tRNAs have to be used in consideration of these 
points. 

Interestingly, approximately 25% of tRNA genes obtained from 
metagenomic sequences were found identical in sequence to genes 
from species-known prokaryotes and assignable to ISGs. By using 
these assigned tRNAs as phylogenetic markers, we have predicted 
the microbial community in an environmental ecosystem at a phy- 
lum level (Figure 2 A) . The number of tRNA genes found in known 
species and a list of the species can be browsed for each environ- 
mental sample (Figures 2B,C). When the numbers are clicked, the 
list of metagenomic tRNAs assignable to phylum can be accessed 
(Figure 2D). This strategy can be applied even to the data of 
short sequences obtained with new- generation sequencers, such 
as metagenomic sequences in the SRA^^ at NCBI. It is notewor- 
thy that phylogenetic clustering of short sequences (e.g., 100 nt) 



^ http : / / www. ncbi. nlm. nih. gov/ sra 
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Table 1 | Comparison of registered data and functions between databases (at IVIarch, 2014). 
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with conventional sequence homology searches is inadequate for 
studying a microbial community in an ecosystem. However, evo- 
lutionary stable tRNA genes can be used as effective phylogenetic 
markers for predicting the microbial community, since full-length 
tRNAs can be found even from short genomic fragments obtained 
with new- generation sequencers. 

The database is also capable of searching for sequences with 
1-3 nt differences (Figure 2A). By using tools in the database as 
well as phylogenetic markers found by users (e.g., genus-specific 
markers), users can search for very specific and rare genomes 
of the user's attention from a wide range of environmental 
samples. 

MINIMUM ANTICODON SET AND NON-STANDARD-TYPE 
tRNAs 

An important curation process by experts was to investigate the 
minimum anticodon set most likely to be essential for the trans- 
lation system of each bacterial species (Osawa, 1995), because 
three computer programs did not necessarily assign this set con- 
cordantly. When we could not find the minimum set in our 
standard search, we reexamined the candidates predicted by one 
or two programs and searched for the most probable candidate 
that is satisfactory to the minimum set, according to various 



criteria such as identifiers for respective anticodons and refer- 
ring to literatures of experiments. We also checked whether 
an identical sequence and the sequences with 1-2 nt differ- 
ences were present in closely related genomes, basing on the 
view that a functionally active gene should be stably maintained 
during evolution. This final check has become increasingly use- 
ful because many closely related species have been sequenced 
currently. The search for the minimum anticodon set in each 
genome can assign tRNAs whose sequences appear to be rather 
non-standard. If non-standard-type tRNAs have been found iter- 
atively in species belonging to a special phylogenetic group, such 
tRNAs will become good phylogenetic markers with high speci- 
ficity. The aforementioned newly added column for comments 
on each tRNA can be used for mentioning this phylogenetic 
information. 

We next explain one example of non-standard- type tRNAs 
more in detail. The Anticodon Table of bacterial tRNAs 
(Figure 3 A) points out that occurrences of A- starting anticodons 
are very rare except for Arg, Leu, and Thr: 0 for Phe, Tyr, Cys, 
Val, Asp, and Gly; 1 for His, He, and Asn; 3 for Ser and 6 for Pro. 
We have manually checked tRNA candidates with these very rare 
A- starting anticodons, on the basis of the aforementioned criteria 
for the minimum anticodon set and/or the iterative occurrences 
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in the closely related species. Information concerning such non- 
standard-type tRNAs can promote experimental studies to prove 
predicted tRNA candidates at the RNA level. For example, we 
have found novel bacterial tRNA genes with TAT anticodon in 
19 species, including Mycoplasma mobile. In many bacteria, ATA 
codons are deciphered by tRNA^^^^ bearing lysidine (L) at the wob- 
ble position (Soma etal., 2003; Suzuki and Miyauchi, 2010); L is a 
modified cytidine introduced post-transcriptionally by tRNAIle- 
lysidine synthetase (TilS). Some bacteria, including M. mobile, do 
not carry the tilS gene, indicating that they have a specific system 



to decode ATA codons. Taniguchi et al. (2013) have experimentally 
shown at the cellular RNA level that the M. mobile tRNA^^^^ reg- 
istered in our database (Figure 3B) contains an unmodified UAU 
anticodon. 

tRNA DATABASE WELL-TIMED IN THE ERA OF BIG 
SEQUENCE DATA 

One important characteristic of tRNADB-CE is inclusion of 
tRNAs found in a large number of partially sequenced genomes. 
The aforementioned search for the minimum anticodon set can be 
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FIGURE 3 I Examples of non-standard-type tRNAs. (A) The anticodon occurrence table of bacterial tRNAs. It is clear that A-starting anticodons are very rare 
in bacterial tRNAs. (B) Cloverleaf secondary structure of M. mobile tRNA"®^. 



applied only to a complete genome. Another check that is applica- 
ble even to partially sequenced genomes is to examine whether 
non-standard-type candidates have been repeatedly found in 
closely related species, and this process has become increasingly 
useful because genomes of many closely related species, even of 
different strains belonging to one species, have been sequenced. 
When the same or almost the same sequence has been found 
repeatedly, we have included the sequence in the Reliable cate- 
gory basing on the knowledge that functionally important genes 
have been stably maintained throughout evolution. Importantly 
this check process can be applied even to metagenomic sequences; 
if the non-standard-type candidates have been found repeatedly 
in metagenomic samples, especially in different environmental 
samples, these can be included in the Reliable category. Accumu- 
lation of a massive number of sequences in the near future should 
further increase the usefulness and reliability of this strategy, indi- 
cating that tRNADB-CE is a well-timed database in the era of big 
sequence data. 

ENHANCING QUALITY OF A LARGE-SCALE DATABASE AND 
EFFICIENT KNOWLEDGE DISCOVERY 

The number of tRNA genes compiled has already become huge 
and will undoubtedly increase more rapidly in the near future. 
For efficient knowledge discovery from such big data, analysis 
tools presently available are inadequate, and new tools suitable 
for big data analyses should be urgently required. Our group 



has developed a bioinformatics tool "BLSOM (Batch- Learning 
Self-Organizing-Map)" with oligonucleotide compositions (Abe 
etal, 2003), which can analyze more than one million genomic 
sequences simultaneously and allows us to efficiently acquire 
a wide range of knowledge from big sequence data. Oligonu- 
cleotides, such as penta- and heptanucleotides, often represent 
motif sequences responsible for sequence-specific protein bind- 
ing (e.g., transcription factor binding). Occurrences of such 
motif oligonucleotides should differ from occurrences expected 
from the mononucleotide composition in each genome and may 
differ among genomic portions within a single genome. Actu- 
ally, we have recently found that a pentanucleotides BLSOM 
for the human genome can detect characteristic enrichment of 
many transcription-factor-binding motifs in pericentric hete- 
rochromatin regions (Iwasaki etal., 2013a), showing that BLSOM 
can effectively detect the characteristic and combinatorial occur- 
rences of functional motif oligonucleotides in the genomic 
sequence. Each tRNA has the characteristic and combinatorial 
occurrences of motif oligonucleotides, which are required to 
fulfill its function (e.g., binding to proper enzymes) and struc- 
tural formation (e.g., clover leaf). To examine the usefulness 
of BLSOM for knowledge discovery from a huge amount of 
tRNAs, we have recently constructed a BLSOM with the pen- 
tanucleotide composition in all bacterial tRNAs in tRNADB-CE. 
Interestingly, tRNAs are separated primarily according to amino 
acid (should be published elsewhere), showing that the BLSOM 
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can detect the characteristic combinations of motif oUgonu- 
cleotides required for proper recognition by various enzymes, 
including aminoacyl-tRNA synthetases. Bacterial tRNAs with 
the same anticodon form their own territories, indicating this 
BLSOM can be used for an informatics method for assign- 
ing reliable tRNAs. For creating a high quality database, it 
is important to find errors that are slipped into the database, 
including those attributable to sequencing errors. An orphan 
tRNA located apart from the proper anticodon territories on 
BLSOM may be a candidate for erroneous cases, for which 
manual checks by experts have to be conducted. This type of 
new informatics strategies is required to find errors present in 
the huge amount of predicted tRNAs, especially in those pre- 
dicted concordantly by all three programs and thus included in 
the database with no manual check. More importantly, BLSOM 
is an unsupervised clustering method with a strong visualiza- 
tion power (Abe etal., 2003) and this unsupervised data-mining 
method can be used for efficient knowledge discoveries from big 
sequence data (Iwasaki etal., 2013b). Informatics tools suitable 
for big data analyses have to be introduced to extract a wide 
range of knowledge from large-scale databases in the era of big 
data. 
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